Student: Giovanni Rosati (part-time pacing)
Pacing: part-time
Scheduled project review date/time: Tuesday July 30, 2019 @ 7:45AM PCT
Instructor: Jeff Herman
Blog URL: https://medium.com/@gio_rosati/dealing-with-missing-data-17f8b5827664
GitHub Repository: https://github.com/giorosati/dsc-3-final-project-online-ds-pt-100118
Kelwin Fernandes, Jaime S. Cardoso, and Jessica Fernandes. 'Transfer Learning with Partial Observability Applied to Cervical Cancer Screening.' Iberian Conference on Pattern Recognition and Image Analysis. Springer International Publishing, 2017.
Dataset link: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29#
The objective of this project is to develop a machine learning model(s) that can accurately predict the presence of cervical cancer from as many as 35 possible risk factors recorded for each woman in the dataset. A challenge when developing prediction models for this type of data is the balance between precision and recall:
Because not "capturing" even one case of cancer could result in death, the models should place emphasis on the recall score - it is preferable to not miss anyone with a significant probability of cancer even if that means "flagging" some patients as likely to have cancer that are actually cancer-free.
Another challenge with creating predictive models using this type of data is that positive cases (cancer is present) are typically a small proportion of the total patients. This unbalanced nature of the data can make prediction more difficult for machine learning models because there are fewer "True" (cancer is present) cases to learn from. With unbalanced data like this one can easily obtain a very high recall at the expense of precision, and vice-versa. For example, imagine a dataset of 1,000 patients where 20 individuals have cancer, a very realistic scenario:
The goal for a predictive model in a scenario like this is to identify a very high percentage of the cancer cases with as few as possible "false positives".
Each models value will be evaluated by looking at recall first and foremost, and then the F1 score, which is a measurement taking into account both precision and recall.
The dataset was collected at 'Hospital Universitario de Caracas' in Caracas, Venezuela. The dataset comprises 35 factors that include demographic information, habits, and historic medical records of 858 patients. Several patients decided not to answer some of the questions because of privacy concerns (missing values). The dataset comes from a research paper first published online in May, 2017.
My approach will follow the OSEMN protocol:
The dataset will be obtained from the University of California at Irvine (UCI) Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29#
After an initial inspection of the dataset to identify missing values and any other obvious issues I will develop and implement processes to minimize the negative impacts of those data values.
Further exploratory data analysys (EDA) will be perfomed to better understand the data including the range and distribution of the values within each factor and any possible correlations between factors and to the prediction target.
Multiple machine learning models will be tested. Additionally, some tuning of the available parameters for each model will be explored in the quest for the best performing model.
The prediction results will be evaluated and any models used will be compared.
Bringing in the required libraries, etc.
# ! conda list
# import of libraries used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
from scipy import stats
from scipy.stats import norm
from sklearn import preprocessing
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
from imblearn.over_sampling import SMOTE
from inspect import signature
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import display, Image
import pydotplus
import xgboost
import shap
# magic commands
%matplotlib inline
# load the dataset as a Pandas dataframe
df_import = pd.read_csv('risk_factors_cervical_cancer.csv', na_values="?")
# look at the dataset overview
df_import.info()
Many factors have missing values.
# look at the count of unique values in each factor, including NaN's
for column in df_import.columns:
print('Unique values and count in {}'.format(column))
print(df_import[column].value_counts(dropna=False))
# view just the count of NaN's in each factor
df_import.isna().sum()
# copy the imported file to a new dataframe to retain the original imported file as a seperate object
df = df_import.copy(deep=True)
The table below summarizes the range and missing percentage for each of the factors. Red text incidates a factor missing values.
Because I do not want to lose any information that may be present in the records with missing values for these factors:
I chose to create new boolean factors to distinguish between records where the information was known and where no data was given. These factors will have a value of true (1) if the record had a value and a value of false (0) if the record was missing data for that factor:
# function to create a new boolean column in a df
# parameters are a dataframe and a column name
# if any value was present for a record in the column provided, the output value will be "1",
# if the value tests true for pd.isna() the output value will be "0".
# returns a list of values corresponding to each record in the provided dataframe
def new_bool(df, col_name):
bool_list = []
for index, row in df.iterrows():
# print(row)
value = row[col_name]
# print(value)
value_out = 1
if pd.isna(value):
value_out = 0
# for testing
# print("value: {} - bool: {}".format(value, str(value_out)))
bool_list.append(value_out)
return bool_list
# create new factor 'is_number_partners_known'
df['is_number_partners_known'] = new_bool(df, 'Number of sexual partners')
# check if operation was successful
df['is_number_partners_known'].value_counts(dropna=False)
# create new factor 'is_first_intercourse_known'
df['is_first_intercourse_known'] = new_bool(df, 'First sexual intercourse')
# check if operation was successful
df['is_first_intercourse_known'].value_counts(dropna=False)
# create new factor 'is_number_pregnancies_known'
df['is_number_pregnancies_known'] = new_bool(df, 'Num of pregnancies')
# check if operation was successful
df['is_number_pregnancies_known'].value_counts(dropna=False)
Before I delete or fill missing values for any factors I will make another copy of the dataset and continue work on the new copy. This is useful if I ever want or need to revert to the dataset as it exists at this stage.
df2 = df.copy(deep=True)
These factors are missing values for over 90% of the records:
With so many missing values I would typically choose to delete this factor but because there is a "parent" factor related to these two I want to examine the distribution of the "STDs: Number of diagnosis" factor.
df2['STDs: Number of diagnosis'].value_counts(dropna=False)
Over 90% of the records have a zero value for this factor and I am assuming that this means that the patient had never been diagnosed with an STD. Because there may be some useful information in the remaining 71 records that have a values of 1, 2, or 3, I will leave this factor as is. It appears that the 787 records that have a value of zero for this factor are the same 787 records that are missing values for the two "time since..." factors.
# subset the records where the value is zero for 'STDs: Number of diagnosis'
value_zero = df2['STDs: Number of diagnosis'] == 0
temp = df2[value_zero]
print('time since first: ', temp['STDs: Time since first diagnosis'].value_counts(dropna=False))
print('time since last: ', temp['STDs: Time since last diagnosis'].value_counts(dropna=False))
This confirms that the 787 records missing values for the two "time since..." factors are the same records with a zero for "STDs: Number of diagnosis". I replaced the missing values in these two "time since..." factors with a zero.
df2['STDs: Time since first diagnosis'].fillna(0, inplace=True)
df2['STDs: Time since first diagnosis'].value_counts(dropna=False)
df2['STDs: Time since last diagnosis'].fillna(0, inplace=True)
df2['STDs: Time since last diagnosis'].value_counts(dropna=False)
For the remaining factors with missing factors I created a function that will display a countplot, a boxplot, and some summary stats. This will help with choosing what value to fill missing values in for each factor.
# function to display a countplot, boxplot, and summary stats for a factor
def countplot_boxplot(column, dataframe):
fig = plt.figure(figsize=(15,20))
fig.suptitle(column, size=20)
ax1 = fig.add_subplot(2,1,1)
sns.countplot(dataframe[column])
# plt.title(column)
plt.xticks(rotation=45)
ax2 = fig.add_subplot(2,1,2)
sns.boxplot(dataframe[column])
# plt.title(column)
plt.xticks(rotation=45)
plt.show()
print('Min:', dataframe[column].min())
print('Mean:', dataframe[column].mean())
print('Median:', dataframe[column].median())
print('Mode:', dataframe[column].mode()[0])
print('Max:', dataframe[column].max())
print('**********************')
print('% of values missing:', (df2[column].isna().sum() / len(df2))*100)
I created three factors for the various ways I will deal with missing values:
# function to replace missing values with the median
def fillna_median(column, dataframe):
dataframe[column].fillna(dataframe[column].median(), inplace=True)
print (dataframe[column].value_counts(dropna=False))
# function to replace missing values with the mean
def fillna_mean(column, dataframe):
dataframe[column].fillna(dataframe[column].mean(), inplace=True)
print (dataframe[column].value_counts(dropna=False))
# function to replace missing values with a value provided
def fillna_w_value(value, column, dataframe):
dataframe[column].fillna(value, inplace=True)
print(dataframe[column].value_counts(dropna=False))
The three factors related to smoking are missing values in less than 2% of the records. I will replace the missing values in all three with a zero. I expect this change to have a minimal effect on the models.
col_list = ['Smokes', 'Smokes (years)', 'Smokes (packs/year)']
for col in col_list:
fillna_w_value(0, col, df2)
Hormonal Contraceptives, IUD, and STDs are boolean factors with about 12-13% of missing values in each. I looked at the distribution of values in each:
col_list = ['Hormonal Contraceptives', 'IUD', 'STDs']
for col in col_list:
print(col,':')
print(df2[col].value_counts(dropna=False))
print("----------------")
I chose to fill the missing values in each factor with the value (0 or 1) having the highest frequency.
# replace more missing values and confirm results
fillna_w_value(1, 'Hormonal Contraceptives', df2)
fillna_w_value(0, 'IUD', df2)
fillna_w_value(0, 'STDs', df2)
The following factors have integer values and all have missing values in 1 to 12% of records:
I looked at the distribution of values in each of these factors before deciding how to fill the missing values.
countplot_boxplot('Number of sexual partners', df2)
The mean of 'Number of sexual partners' is pulled higher by some outliers but only 3% of values are missing. I will replace the missing values with 2.0 which is both the mode and the median value.
# replace with median
fillna_median('Number of sexual partners', df2)
countplot_boxplot('First sexual intercourse', df2)
Less than 1% of values are missing for this factor. I will replace those with the median value.
# replace with median
fillna_median('First sexual intercourse', df2)
countplot_boxplot('Num of pregnancies', df2)
About 6.5% of the values are missing and I will replace those values with the median.
# replace with median
fillna_median('Num of pregnancies', df2)
countplot_boxplot('Hormonal Contraceptives (years)', df2)
About 12.6% of the records are missing values for this factor and records with a zero value are by far the largest group. There are quite a few outliers, all falling in the higher value range. These high outliers pull the mean substantially higher than the median. An argument could be made to fill the missing values with either the mode (0) or the median (0.5). I chose to fill them with the mode (0).
# replace with zeros
fillna_w_value(0,'Hormonal Contraceptives (years)', df2)
countplot_boxplot('IUD (years)', df2)
This factor has over 13% missing values and zero occurrs in an overwhelming majority of the records. I chose to fill the missing values with zero.
# replace with zeros
fillna_w_value(0, 'IUD (years)', df2)
countplot_boxplot('STDs (number)', df2)
I chose to replace the approximately 12% of missing values in this factor with zeros because zero is again the overwhelmingly most common value.
# replace with zeros
fillna_w_value(0, 'STDs (number)', df2)
There are 12 individual boolean factors, each for a different STD. Each has about 12% missing values. Because I replaced the missing values in the "STDs (number)" factor with zeros, I will also replace the missing values in each of these factors with zero.
# replace missing values with a zero
col_list = [
'STDs:condylomatosis',
'STDs:cervical condylomatosis',
'STDs:vaginal condylomatosis',
'STDs:vulvo-perineal condylomatosis',
'STDs:syphilis',
'STDs:pelvic inflammatory disease',
'STDs:genital herpes',
'STDs:molluscum contagiosum',
'STDs:AIDS',
'STDs:HIV',
'STDs:Hepatitis B',
'STDs:HPV']
for col in col_list:
fillna_w_value(0, col, df2)
In the list above it is apparent that two of these 12 factors have no information:
I dropped them from the dataframe.
# drop useless factors
df2.drop(['STDs:cervical condylomatosis','STDs:AIDS'], axis=1, inplace=True)
# look for any remaining NaN's in the dataframe
df2.isna().sum()
I will first look at a heatmap to see if any factors show obvious colinearity.
# pull the target variable from the dataframe and use a heatmap to look at possible correlations between factors
temp_df = df2.drop('Biopsy', axis=1)
f,ax = plt.subplots(figsize=(20,20))
sns.heatmap(temp_df.loc[:,:].corr(), annot=True, cmap="Blues", fmt='.1f' )
plt.show()
Some factors show strong correlation but this is not surprising.
For example, "STDs", "STDs (number)", and "STDs: Number of diagnosis", all show signifigant correlations to the individual STD factors and this would be expected.
Additionally, the following factors all show some correlation with age:
It is not surprising that these factors would be correlated with a patients age.
Next I looked how factors correlated to the target variable (Biopsy). I chose to look at the top 12 factors (ranked by correlation to Biopsy).
# look at a correlation matrix of the top 12 factors to the target variable: Biopsy
corrmat = df2.corr()
k = 12 #number of variables for heatmap
cols = corrmat.nlargest(k, 'Biopsy')['Biopsy'].index
cm = np.corrcoef(df2[cols].values.T)
plt.figure(figsize=(15,15))
sns.set(font_scale=2)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10},
yticklabels = cols.values, xticklabels = cols.values)
plt.show()
Schiller, Hinselmann, and Citology are factors representing medical diagnostic tools used to identify cervical cancer. It's not surprising they have the strongest correlations with Biopsy.
# make a backup copy of df2 at this stage of processing
df_backup = df2.copy()
First step: Seperate the dataset into the predictive factors and the target variable
# create X and y
X = df2.drop(['Biopsy'], axis=1)
y = df2['Biopsy']
X.info()
y.head()
# create an 80/20 train/test split with a specified random value for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state= 10)
Next: Standardization and normalization If there is a difference in the range of values between factors, many machine learning models will not properly compensate for these differences and this can negatively impact a models' performance. None of the numeric factors in this dataset:
have a normal distribution. Only "Age" is remotely "normal". Accordingly I am choosing to standardize all of these factors into a range from zero to 1. Standardization is done AFTER the test/train split so that the training data does not have the advantage of any information from the testing data.
# standardize the continuous factors
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
X_train = pd.DataFrame(minmax_scale.fit_transform(X_train), columns = X.columns)
X_test = pd.DataFrame(minmax_scale.fit_transform(X_test), columns = X.columns)
Confirm that X and Y for both test and train sets have the same number of records.
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
Now look at the value counts for the target variable.
y_train.value_counts()
y_test.value_counts()
In both the test and train sets, the true value (1) is a small percentage of the total records. This creates an "unbalanced" dataset and makes prediction with machine learning models more difficult. Varying ways of addressing this scenario exist and some machine learning models include parameters specifically designed to improve performance with unbalanced datasets. Libraries have been developed that create new "fictitous" samples of an underrepresented class (or classes) based on the "real" examples from a class.
I will experiment with a few options to address the unbalanced data in some of the models.
I chose to explore the following models:
As mentioned above, one of the strategies that I will test is 'over-sampling' - creating new, fictitous samples such that the target variable in the training data is balanced between the two classes (True/False). The target variable in the original dataset ("Biopsy") is very unbalanced with about 90% of the samples being False. Oversampling will be used to create new samples such that the training data target variable will be equally distributed between True and False values. The oversampling library I will be using is 'SMOTE' from the imbalanced-learn project (https://imbalanced-learn.readthedocs.io/en/stable/index.html). SMOTE is an acronym for Synthetic Minority Over-sampling Technique.
# using SMOTE to create a balanced training dateset
sm = SMOTE(sampling_strategy='auto', random_state=10)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)
X_train.shape
X_train_res.shape
y_train.shape
y_train_res.shape
unique_elements, counts_elements = np.unique(y_train_res, return_counts=True)
print("Frequency of unique values :")
print(np.asarray((unique_elements, counts_elements)))
Now I also have a balanced dataset to test with (X_train_res, y_train_res)
Because this project is attempting to predict the presenct of a life-threatening illness, it is preferable to not miss any positives (cancer present), even if that means getting more false positives (predict patient has cancer when she does not).
Preference in scoring will be given to models with high recall (the percentage of patients with cancer that the model captures). A higher false positive rate (model predicted cancer, but no cancer present) will be tolerated to achieve higher recall.
# create a function to display a graph of precision and recall and the scores I am interested in
def analysis(model, X_train, y_train):
model.fit(X_train, y_train)
# predict probabilities
probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
probs = probs[:, 1]
# predict class values
preds = model.predict(X_test)
# calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, probs)
# calculate average precision
average_precision = average_precision_score(y_test, probs)
# recall score for class 1 (Predict that Biopsy is True)
rs = recall_score(y_test, preds)
# calculate F1 score
f1 = f1_score(y_test, preds)
# calculate precision-recall AUC
auc_score = auc(recall, precision)
# create chart
# In matplotlib < 1.5, plt.fill_between does not have a 'step' argument
step_kwargs = ({'step': 'post'}
if 'step' in signature(plt.fill_between).parameters
else {})
plt.step(recall, precision, color='b', alpha=0.2, where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
# plot a "no skill" line
plt.plot([0, 1], [0.5, 0.5], linestyle='--')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: Average Precision={0:0.3f}'.format(average_precision))
plt.show()
# print(confusion_matrix(y_test, preds))
print('Classification Report:')
print(classification_report(y_test, preds))
print('f1=%.3f auc=%.3f recall=%.3f' % (f1, auc_score, rs))
# print()
# print('**************************************** Original chart:')
# # plot a "no skill" line
# plt.plot([0, 1], [0.5, 0.5], linestyle='--')
# # plot the precision-recall curve for the model
# plt.plot(recall, precision, marker='.')
# plt.xlabel('recall')
# plt.ylabel('precision')
# # show the plot
# plt.show()
# function to plot the importances of factors used in a model
def plot_feature_importances(model):
n_features = X_train.shape[1]
plt.figure(figsize=(15,15))
plt.barh(range(n_features), model.feature_importances_, align='center')
plt.yticks(np.arange(n_features), X.columns)
plt.xlabel("Feature importance")
plt.ylabel("Feature")
# logistic regression model with no adjustment for unbalanced class of target variable
model = LogisticRegression(fit_intercept = False,
C = 1e12,
class_weight='None',
solver='lbfgs',
max_iter = 4000)
analysis(model, X_train, y_train)
F1 score of below .5 is no better than random guessing.
I then tried this model using the class_weight parameter set to 'balanced' to compensate for the unbalanced data.
# logistic regression model using class_weight-balanced argument
model = LogisticRegression(fit_intercept = False,
C = 1e12,
class_weight='balanced',
solver='lbfgs',
max_iter = 4000)
analysis(model, X_train, y_train)
Using the 'balanced' option for the class_weight parameter resulted in even lower recall and F1 scores.
Next, try it with the resampled data.
# logistic regression model using class_weight-balanced argument
model = LogisticRegression(fit_intercept = False,
C = 1e12,
solver='lbfgs',
max_iter = 4000)
analysis(model, X_train_res, y_train_res)
Using the balanced dataset the recall improved substantially but the F1 score is still very poor and this model is no better than guessing.
K nearest neighbors models typically do not work well with unbalanced data like this. I will try the model with n_neighbors=1. Then I will employ a different sampling method to create a "syntheticaly" balanced dataset.
# KNN
model = KNeighborsClassifier(n_neighbors=1)
analysis(model, X_train, y_train)
As expected this model performs very poorly. Again as recall increases, precision plummets.
Now trying with the resampled data.
# KNN
model = KNeighborsClassifier(n_neighbors=1)
analysis(model, X_train_res, y_train_res)
A signicant improvement but still very poor performance.
I will employ grid search to experiment with different values for some of the parameters of a decision tree.
# create a Decision Tree
tree_clf = DecisionTreeClassifier(random_state=10)
# create a parameter grid
dt_param_grid = {
'class_weight': ['balanced'],
'criterion': ['gini', 'entropy'],
'max_depth': [2, 3, 4, 5, 6, 7]}
dt_grid_search = GridSearchCV(tree_clf,
param_grid = dt_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
dt_grid_search.fit(X_train, y_train)
dt_gs_training_score = np.mean(dt_grid_search.cv_results_['mean_train_score'])
dt_gs_testing_score = dt_grid_search.score(X_test, y_test)
print("Mean Training Score: {:.4}%".format(dt_gs_training_score * 100))
print("Mean Testing Score: {:.4}%".format(dt_gs_testing_score * 100))
print("Best Parameter Combination Found During Grid Search:")
dt_grid_search.best_params_
Now I will create a decision tree using the best values determined by grid search.
tree = DecisionTreeClassifier(class_weight = 'balanced', criterion = "gini", max_depth = 2, random_state=10)
analysis(tree, X_train, y_train)
Best performance so far.
Look at the feature importances for this tree.
plot_feature_importances(tree)
plt.show()
Now try using the resampled data.
# use grid search again, but with the resampled data
# create a Decision Tree
tree_clf = DecisionTreeClassifier(random_state=10)
# create a parameter grid
dt_param_grid = {
'class_weight': [None],
'criterion': ['gini', 'entropy'],
'max_depth': [2, 3, 4, 5, 6, 7]}
dt_grid_search = GridSearchCV(tree_clf,
param_grid = dt_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
dt_grid_search.fit(X_train_res, y_train_res)
dt_gs_training_score = np.mean(dt_grid_search.cv_results_['mean_train_score'])
dt_gs_testing_score = dt_grid_search.score(X_test, y_test)
print("Mean Training Score: {:.4}%".format(dt_gs_training_score * 100))
print("Mean Testing Score: {:.4}%".format(dt_gs_testing_score * 100))
print("Best Parameter Combination Found During Grid Search:")
dt_grid_search.best_params_
Create a tree using these parameters
tree = DecisionTreeClassifier(criterion = "entropy", max_depth = 6, random_state=10)
analysis(tree, X_train_res, y_train_res)
Both recall and F1 scores are lower using the resampled data.
I want to see what features are most important in this decision tree.
It's not surprising that "Schilller" is by far the most important feature as it represents a diagnostic test to determine the presence of this cancer.
I used grid search to search for the best parameters when creating a random forest. I chose "recall" for the scoring measurement and 3 for cross-validation folds.
# use grid search for the random forest classifier, use recall as the factor to optimize
forest = RandomForestClassifier(random_state=10, n_jobs=-1)
forest_param_grid = {
'class_weight': ['balanced'],
'criterion': ['gini', 'entropy' ],
'max_depth': [2, 3, 4, 5, 6, 7, 8],
'n_estimators': [20, 40, 50, 60, 80, 100, 200]}
forest_grid_search = GridSearchCV(forest,
param_grid = forest_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
import time
start = time.time()
forest_grid_search.fit(X_train, y_train)
print("Testing Accuracy: {:.4}%".format(forest_grid_search.best_score_ * 100))
print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
print("")
print("Optimal Parameters: {}".format(forest_grid_search.best_params_))
Based on these results, I attempted to fine-tune some of the parameters by creating a new grid search.
forest_param_grid = {
'class_weight': ['balanced'],
'criterion': ['gini'],
'max_depth': [2, 3, 4],
'n_estimators': [10, 15, 20, 25, 30]}
forest_grid_search = GridSearchCV(forest,
param_grid = forest_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
import time
start = time.time()
forest_grid_search.fit(X_train, y_train)
print("Testing Accuracy: {:.4}%".format(forest_grid_search.best_score_ * 100))
print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
print("")
print("Optimal Parameters: {}".format(forest_grid_search.best_params_))
Now using the values obtained I will create a new random forest.
forest = RandomForestClassifier(n_estimators=10,
criterion='gini',
max_depth=2,
class_weight='balanced',
random_state=10)
analysis(forest, X_train, y_train)
This model improved the recall and F1 scores signifigantly. Let's look at the feature importances in this model.
plot_feature_importances(forest)
plt.show()
This model did a much better job identifying the factors that showed correlation to Biopsy in the heatmap. Taking a look at how the trees varied in terms of feature importances:
# Plot the feature importances of the forest
plt.figure(figsize=(20,20))
plt.title("Feature importances")
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
axis=0)
labels = np.array(X.columns)
label_importance_std = pd.DataFrame(columns=['Factor','Importance', 'STD'])
label_importance_std['Factor'] = labels
label_importance_std['Importance'] = importances
label_importance_std['STD'] = std
label_importance_std.sort_values('Importance', inplace=True, ascending=False)
#make the graph
plt.bar(range(X_train.shape[1]), label_importance_std['Importance'],
color="r", yerr=label_importance_std['STD'], align="center")
plt.xticks(range(X_train.shape[1]), label_importance_std['Factor'], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.show()
The solid black line in the middle of the red bars indicates the standard deviation for each factor's importance from all the trees.
Now I will try a random forest model using the resampled data.
# use grid search for the random forest classifier, use recall as the factor to optimize - resampled, balanced data
forest = RandomForestClassifier(random_state=10, n_jobs=-1)
forest_param_grid = {
'class_weight': [None],
'criterion': ['gini', 'entropy'],
'max_depth': [5, 6, 7, 8, 9, 10, 11, 12],
'n_estimators': [10, 20, 30, 40, 50, 60]}
forest_grid_search = GridSearchCV(forest,
param_grid = forest_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
import time
start = time.time()
forest_grid_search.fit(X_train_res, y_train_res)
print("Testing Accuracy: {:.4}%".format(forest_grid_search.best_score_ * 100))
print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
print("")
print("Optimal Parameters: {}".format(forest_grid_search.best_params_))
# refine the gridsearch based on the results of this test
forest = RandomForestClassifier(random_state=10, n_jobs=-1)
forest_param_grid = {
'class_weight': [None],
'criterion': ['gini'],
'max_depth': [7,8,9],
'n_estimators': [35, 40, 45]}
forest_grid_search = GridSearchCV(forest,
param_grid = forest_param_grid,
scoring = 'recall',
cv=3,
return_train_score=True)
import time
start = time.time()
forest_grid_search.fit(X_train_res, y_train_res)
print("Testing Accuracy: {:.4}%".format(forest_grid_search.best_score_ * 100))
print("Total Runtime for Grid Search on Random Forest Classifier: {:.4} seconds".format(time.time() - start))
print("")
print("Optimal Parameters: {}".format(forest_grid_search.best_params_))
#create the model using the best parameters
forest = RandomForestClassifier(criterion='gini',
max_depth=8,
n_estimators=40,
random_state=10,
n_jobs=-1)
analysis(forest, X_train_res, y_train_res)
Again the recall and F1 scores are lower when using the resampled data.
#SVM with class_weight='balanced'
clf = svm.SVC(cache_size=1000, gamma='scale', class_weight='balanced', probability=True)
analysis(clf, X_train, y_train)
Both recall and F1 scores improved markedly when using 'balanced' as the class_weight parameter. Now try with the resampled data.
#try with resampled data
clf = svm.SVC(cache_size=1000, gamma='scale', probability=True)
analysis(clf, X_train_res, y_train_res)
Again, the model performs worse when using the resampled data.
Of the five models explored, the Random Forest was the most successsful when ranked by the recall and F1 measurements, in that order.
A recall score of 0.875 was reached by three models:
As anticipated, KNN performed the poorest, especially regarding recall.
Creating fictitous data samples was quite helpful with the logistic regression and KNN models, while the remaining models performed better with the original data and using their built-in parameters to balance the classes.
Looking deeply at the two best performing models.
# look at the best performing decision tree
best_tree = DecisionTreeClassifier(class_weight = 'balanced', criterion = "gini", max_depth = 2, random_state=10)
best_tree.fit(X_train, y_train)
dt_feature_names = list(X_test.columns)
# dt_target_names = [str(s) for s in Y_test.unique()]
dt_target_names = ['False', 'True']
export_graphviz(best_tree, out_file='tree.dot',
feature_names=dt_feature_names, class_names=dt_target_names,
filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())
This graph shows how the tree makes decisions about how to classify each patient. It is only using 3 variables:
We can see from this flow chart that if a woman's Schiller result was False ('0'), and her Dx:CIN results were also False, the model predicts Biopsy to be False.
Looking at the right branch of the tree where a woman's Schiller test was True ('1'), the only way the model would predict False is if this woman was in the bottom 5% of the age range, approximately under 17 years old. Similarly, if a woman's Schiller results were positive and her age was 17 or older the model would always predict True for Biopsy.
Now let's look at an individual record's predictions.
# look at test record # 71
X_test.iloc[71,:]
Record #71 had a Schller value of 1 and an Age of .077
Because Age is higher than the decision threshold for that factor (0.049) the model predicted True for Biopsy.
# https://github.com/slundberg/shap
# load JS visualization code to notebook
shap.initjs()
# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(best_tree)
shap_values = explainer.shap_values(X_test)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[1], shap_values[1][71,:], X_test.iloc[71,:], link="logit")
# were there any X_test records where Schiller was 1 but age was below the threshold?
temp = X_test.loc[X_test["Schiller"] == 1]
temp[temp['Age'] < 0.049]
There were no patients in the test data that had a '1' for Schiller AND an age value less than 0.049
This is why the Shap graph above only shows Schiller
# were there any X_train records where Schiller was 1 but age was below the threshold?
temp = X_train.loc[X_train["Schiller"] == 1]
temp[temp['Age'] < 0.049]
Yes. The training data had 3 records where the patient's Schiller value was True, yet their age was below the threshold used in the decision tree.
The actual record Biopsy value is:
y_test.iloc[71]
This Shap graph below shows how important the Schiller variable was for making a prediction by this model.
shap.force_plot(explainer.expected_value[1], shap_values[1], X_test, link="logit")
forest = RandomForestClassifier(n_estimators=10,
criterion='gini',
max_depth=2,
class_weight='balanced',
random_state=10)
forest.fit(X_train, y_train)
forest_preds = forest.predict(X_test)
# load JS visualization code to notebook
shap.initjs()
# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(forest)
shap_values = explainer.shap_values(X_test)
# shap_values has two records (0,1)
# each has 172 records of 36 columns
# each column contains the + or - "pull" of each factor on that records prediction
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value[0], shap_values[0][71,:], X_test.iloc[71,:], link="logit")
# explainer.expected_value = [0.5125093385577202, 0.4874906614422798]
forest_preds[71]
shap.force_plot(explainer.expected_value[1], shap_values[1], X_test, link="logit")
def output_tree(tree):
export_graphviz(tree, out_file='tree.dot', feature_names=dt_feature_names,
class_names=dt_target_names, filled=True, rounded=True, special_characters=True)
graph = pydotplus.graph_from_dot_file('tree.dot')
return graph.create_png()
dt_feature_names = list(X_test.columns)
dt_target_names = ['False', 'True']
for i in range(len(forest.estimators_)):
tree = forest.estimators_[i]
im = output_tree(tree)
print()
print(' **************** Decision Tree # ', i+1, ' ********************' )
print()
display(Image(im))
print()
print()
print()